13 research outputs found

    Memory-Efficient Recursive Evaluation of 3-Center Gaussian Integrals

    Full text link
    To improve the efficiency of Gaussian integral evaluation on modern accelerated architectures FLOP-efficient Obara-Saika-based recursive evaluation schemes are optimized for the memory footprint. For the 3-center 2-particle integrals that are key for the evaluation of Coulomb and other 2-particle interactions in the density-fitting approximation the use of multi-quantal recurrences (in which multiple quanta are created or transferred at once) is shown to produce significant memory savings. Other innovation include leveraging register memory for reduced memory footprint and direct compile-time generation of optimized kernels (instead of custom code generation) with compile-time features of modern C++/CUDA. High efficiency of the CPU- and CUDA-based implementation of the proposed schemes is demonstrated for both the individual batches of integrals involving up to Gaussians with low and high angular momenta (up to L=6L=6) and contraction degrees, as well as for the density-fitting-based evaluation of the Coulomb potential. The computer implementation is available in the open-source LibintX library.Comment: 37 pages, 2 figures, 6 table

    Uncontracted Rys Quadrature Implementation of up to G Functions on Graphical Processing Units

    Get PDF
    An implementation is presented of an uncontracted Rys quadrature algorithm for electron repulsion integrals, including up to g functions on graphical processing units (GPUs). The general GPU programming model, the challenges associated with implementing the Rys quadrature on these highly parallel emerging architectures, and a new approach to implementing the quadrature are outlined. The performance of the implementation is evaluated for single and double precision on two different types of GPU devices. The performance obtained is on par with the matrix−vector routine from the CUDA basic linear algebra subroutines (CUBLAS) library

    New Multithreaded Hybrid CPU/GPU Approach to Hartree−Fock

    Get PDF
    In this article, a new multithreaded Hartree–Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code

    Distributed Memory, GPU Accelerated Fock Construction for Hybrid, Gaussian Basis Density Functional Theory

    Full text link
    With the growing reliance of modern supercomputers on accelerator-based architectures such a GPUs, the development and optimization of electronic structure methods to exploit these massively parallel resources has become a recent priority. While significant strides have been made in the development of GPU accelerated, distributed memory algorithms for many-body (e.g. coupled-cluster) and spectral single-body (e.g. planewave, real-space and finite-element density functional theory [DFT]), the vast majority of GPU-accelerated Gaussian atomic orbital methods have focused on shared memory systems with only a handful of examples pursuing massive parallelism on distributed memory GPU architectures. In the present work, we present a set of distributed memory algorithms for the evaluation of the Coulomb and exact-exchange matrices for hybrid Kohn-Sham DFT with Gaussian basis sets via direct density-fitted (DF-J-Engine) and seminumerical (sn-K) methods, respectively. The absolute performance and strong scalability of the developed methods are demonstrated on systems ranging from a few hundred to over one thousand atoms using up to 128 NVIDIA A100 GPUs on the Perlmutter supercomputer.Comment: 45 pages, 9 figure

    Modernizing the core quantum chemistry algorithms

    No full text
    This document covers the basics of computational chemistry and how using the modern programming techniques the theory can be efficiently implemented on digital computers. The computer implementations are developed from the core two-electron integrals to many-body and coupled cluster algorithms. A particular attention is paid to the physical constraints of he computer resources and the emergence of the novel architectures.</p

    High-performance evaluation of high angular momentum 4-center Gaussian integrals on modern accelerated processors

    Full text link
    We present a high-performance evaluation method for 4-center 2-particle integrals over Gaussian atomic orbitals with high angular momenta (l≥4l\geq4) and arbitrary contraction degrees on graphical processing units (GPUs) and other accelerators. The implementation uses the matrix form of McMurchie-Davidson recurrences. Evaluation of the 4-center integrals over four l=6l=6 (ii) Gaussian AOs in the double precision (FP64) on an NVIDIA V100 GPU outperforms the reference implementation of the Obara-Saika recurrences (Libint{\tt Libint}) running on a single Intel Xeon core by more than a factor of 1000, healthily exceeding the 73:1 ratio of the respective hardware peak FLOP rates while reaching almost 50\% of the V100 peak. The approach can be extended to support AOs with even higher angular momenta; for low angular momenta alternative approaches will be needed to achieve optimal performance. The implementation is part of an open-source LibintX{\tt LibintX} library feely available at github.com:ValeevGroup/LibintX{\tt github.com:ValeevGroup/LibintX}

    New Multithreaded Hybrid CPU/GPU Approach to Hartree−Fock

    No full text
    In this article, a new multithreaded Hartree–Fock CPU/GPU method is presented which utilizes automatically generated code and modern C++ techniques to achieve a significant improvement in memory usage and computer time. In particular, the newly implemented Rys Quadrature and Fock Matrix algorithms, implemented as a stand-alone C++ library, with C and Fortran bindings, provides up to 40% improvement over the traditional Fortran Rys Quadrature. The C++ GPU HF code provides approximately a factor of 17.5 improvement over the corresponding C++ CPU code.Reprinted (adapted) with permission from Journal of Chemical Theory and Computation 8 (2012): 4166, doi:10.1021/ct300526w. Copyright 2012 American Chemical Society.</p

    Fast and Flexible Coupled Cluster Implementation

    No full text
    A new coupled cluster singles and doubles with triples correction, CCSD(T), algorithm is presented. The new algorithm is implemented in object oriented C++, has a low memory footprint, fast execution time, low I/O overhead, and a flexible storage backend with the ability to use either distributed memory or a file system for storage. The algorithm is demonstrated to work well on single workstations, a small cluster, and a high-end Cray computer. With the new implementation, a CCSD(T) calculation with several hundred basis functions and a few dozen occupied orbitals can run in under a day on a single workstation. The algorithm has also been implemented for graphical processing unit (GPU) architecture, giving a modest improvement. Benchmarks are provided for both CPU and GPU hardware.Reprinted (adapted) with permission from Journal of Chemical Theory and Computation 9 (2013): 3385, doi:10.1021/ct400054m. Copyright 2013 American Chemical Society.</p
    corecore